========================================================
Tennis is one of my favorite sports and although i don’t play very well, I do watch some tennis matches. Roger Federer is my favorite player and he’s had a big resurgence this year which has got people wondering - Has he changed his game? What is he doing differently? I had been on Kaggle and seen the ATP dataset and I thought I’ll use this for my project. I picked the data from 2012 because I wanted one far enough for me to have forgotten what actually happened in that season. So who won the most matches that year? What are the factors that influence players winning matches? This report explores a dataset containing results for Tennis players on the ATP tour in 2012.
The dataset consists of 49 variables with about 3025 observations. Each observation represents matches played on the ATP Tennis Tour in 2012.
As each oberservation is a match, most variables come in twos; one for the winner and one for the loser. This is why there are 49 variables but actually 30 distinct variables with 19 variables with are doubled.
The singular variables are:
The duplcated variables are:(prefixed wih winner/w or loser/l)
## 'data.frame': 3025 obs. of 49 variables:
## $ tourney_id : Factor w/ 148 levels "2012-1536","2012-1720",..: 28 28 28 28 28 28 28 28 28 28 ...
## $ tourney_name : Factor w/ 148 levels "Acapulco","Atlanta",..: 145 145 145 145 145 145 145 145 145 145 ...
## $ surface : Factor w/ 4 levels "Carpet","Clay",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ draw_size : int 32 32 32 32 32 32 32 32 32 32 ...
## $ tourney_level : Factor w/ 6 levels "A","C","D","F",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ tourney_date : int 20120730 20120730 20120730 20120730 20120730 20120730 20120730 20120730 20120730 20120730 ...
## $ match_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ winner_id : int 103888 105575 103598 104871 103163 104919 104594 104735 105023 103794 ...
## $ winner_seed : int 1 NA NA 6 4 NA NA NA 8 NA ...
## $ winner_entry : Factor w/ 4 levels "","LL","Q","WC": 1 4 1 1 1 1 1 1 1 1 ...
## $ winner_name : Factor w/ 305 levels "Adam Kellner",..: 193 240 298 142 281 178 195 279 260 33 ...
## $ winner_hand : Factor w/ 3 levels "L","R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ winner_ht : int 188 175 185 188 188 188 194 180 198 178 ...
## $ winner_ioc : Factor w/ 74 levels "ALG","ARG","AUS",..: 72 42 6 26 28 2 3 28 72 28 ...
## $ winner_age : num 30.6 22.1 32 25.5 34.3 ...
## $ winner_rank : int 15 103 67 47 36 63 64 107 38 105 ...
## $ winner_rank_points: int 1955 549 686 899 1138 707 706 538 1029 543 ...
## $ loser_id : int 103451 103917 103908 104273 103188 104198 105028 105332 104214 105449 ...
## $ loser_seed : int NA NA NA NA NA NA NA 7 NA NA ...
## $ loser_entry : Factor w/ 4 levels "","LL","Q","WC": 1 1 1 1 3 1 3 1 1 4 ...
## $ loser_name : Factor w/ 428 levels "Adam Kellner",..: 56 308 322 112 289 155 198 54 165 387 ...
## $ loser_hand : Factor w/ 3 levels "L","R","U": 2 2 2 2 2 2 1 2 2 2 ...
## $ loser_ht : int 175 190 185 188 173 188 175 196 185 188 ...
## $ loser_ioc : Factor w/ 81 levels "ALG","ARG","AUS",..: 29 27 27 27 78 24 12 27 62 78 ...
## $ loser_age : num 32.8 30.5 30.5 28.7 34.2 ...
## $ loser_rank : int 77 62 142 100 81 71 89 50 92 360 ...
## $ loser_rank_points : int 627 744 387 555 615 651 585 880 579 115 ...
## $ score : Factor w/ 1286 levels " W/O","0-6 6-2 7-5",..: 180 848 186 160 794 1237 166 691 1112 805 ...
## $ best_of : int 3 3 3 3 3 3 3 3 3 3 ...
## $ round : Factor w/ 9 levels "BR","F","QF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ minutes : int 113 102 105 131 75 119 134 71 133 73 ...
## $ w_ace : int 15 10 12 19 5 6 5 0 18 7 ...
## $ w_df : int 3 5 3 5 3 4 1 4 10 2 ...
## $ w_svpt : int 82 68 70 102 45 77 95 60 113 59 ...
## $ w_1stIn : int 49 33 35 64 27 48 62 33 61 28 ...
## $ w_1stWon : int 34 28 30 50 23 39 43 22 51 20 ...
## $ w_2ndWon : int 22 20 16 19 13 18 20 8 29 18 ...
## $ w_SvGms : int 13 11 12 15 9 12 16 10 18 9 ...
## $ w_bpSaved : int 6 0 1 5 0 0 3 2 4 3 ...
## $ w_bpFaced : int 7 0 4 6 0 0 6 7 6 4 ...
## $ l_ace : int 5 10 4 9 5 8 5 5 8 7 ...
## $ l_df : int 5 1 5 7 3 5 3 3 1 3 ...
## $ l_svpt : int 81 90 87 103 61 78 87 65 112 57 ...
## $ l_1stIn : int 47 55 43 68 41 40 60 35 70 30 ...
## $ l_1stWon : int 31 39 26 57 28 30 42 20 56 21 ...
## $ l_2ndWon : int 15 15 17 14 8 25 15 5 23 11 ...
## $ l_SvGms : int 12 11 13 16 9 12 14 11 18 10 ...
## $ l_bpSaved : int 5 11 12 3 4 0 3 1 1 5 ...
## $ l_bpFaced : int 9 13 19 4 7 0 5 9 3 9 ...
As previously noted, each observarion is a match between two players. My first interest is the number of wins for each player which is not a variable! However, I should be able to get that from the winner_name column.
There are at least 305 match winners so the x-axis is a bit over crowded.I’ll look at a summary of the winner_name column and use that to subset players.
Running summary on the winner column gives me the counts(wins) for each player however, it doesn’t display median or quantile values. I’ll create a dataframe with the players and count the number of wins so i can have a better look at the distribution of data.
## 'data.frame': 305 obs. of 2 variables:
## $ player_name: Factor w/ 305 levels "Adam Kellner",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ wins : int 1 1 4 1 2 7 26 14 1 14 ...
## player_name wins
## Adam Kellner : 1 Min. : 1.000
## Adrian Mannarino : 1 1st Qu.: 1.000
## Adrian Ungur : 1 Median : 3.000
## Aisam Ul Haq Qureshi: 1 Mean : 9.918
## Albano Olivetti : 1 3rd Qu.:13.000
## Albert Montanes : 1 Max. :76.000
## (Other) :299
Looking at the summary of wins, we have a median of 3 and an average of 9 wins per player. 75% of the players have 13 wins or less. if I ordered the data and it would be a very long tail.
25% of players or lower had only one win and I will start by plotting with player with over 3 wins which is the median.
Looking at the peaks in this graph, you can see that a few players have a large number of wins. It’s a bit difficult to scan along the lines and pick out the highest peaks. I think flipping the same data on the side and ordering with the highest number of wins first might help. I’ll also decrease the sample of data so we can see the highest winners more clearly. I’ll look at players with more than the average number of wins.
The players with the highest number of wins are clearer to see this way. The players with the top 10 wins from 2012 are:
David Ferrer won the highest number of matches and Rafa Nadal is missing from the top 10 wins. These two unexpected results could be down to the number of matches played - David Ferrer might have played far more matches than everyone else. I’ll have a look at the number of macthes played and also check the number of wins per matches played.
The win to matches played ratio might shed a better light on who wins most of the matches they play.
Again, there’s no column for this so I’ll have to make one. I’ll need to count the player names from both the winner_name and loser_name columns to get the total number of matches.
Using the win ratio(for players with more than 13 wins), the data looks more normalised than before. I used the number of wins as the label so the win ratio could easily be seen in comparison to the number of wins.We can see that although David Ferrer has a high win ratio, he didn’t have the best, with Novak Djovokic, Roger Federer and Rafa Nadal having higher win ratios. I wondered about Rafa Nadal when he ddin’t appear in the top 10 wins but his win ratio was second best!
To the left is Jerzy Janwoicz, who has a higher win ratio than some of the guys with the top 10 wins : Tomas Berdych and Nicolas Almagro even though they had more wins.
Is the win ratio a better judge of the ability of a player to win matches than the number of wins? Or is slighlty affected by number of matches played? Next, I’ll take a look at who had the most tournament wins, will it be players with the most wins or players with the highest win ratio?
There’s actually no variable in the dataset showing who won torunaments but we have rounds which show the final round and the winner of the final round will be the tournament winner.
It appears that the the number of matches won correspinds largely win the number of Tournament wins. This is also true with the win ratio. The 5 players with the highest win ratio were also among the top 7 players with the highest number of tournaments wins. However, with Rafa Nadal winning 4 tournaments having played fewer matches suggests that win ratio might be a better suggestion of who would win more tournaments. Also here is Juan Monaco - who isn’t in the top 10 wins or the top ten win_ratio, yet he won 4 tournaments.
The tours_win dataset has surface and tournel_level variables which show the type of surface the tournament was won on and the level of tournament.
Juan Monaco won all his tournaments at the A Tour level and also won 3 of his 4 tournaments on Clay. I can also see Rafa Nadal only won tournaments on Clay. Novak Djokovic also won all of his 6 titles on Hard Courts. D. Ferrer, Roger Federer were the only players to win across all surfaces. It’s still not too apparent why Juan Monaco was able to win that many tournaments, without being in the top 10 of wins and win_ratio but it would appear that some players do seem to perform very well on particular surfaces. What were the percentage wins of these players on the various surfaces?
This shows the percentage of the total wins that each player had per surface. So for example, 55% of rafa Nadal’s wins came on Clay and 67% of Novak Djovokic’s win came on clay. We can see some players like Nicolas Amalgro, Tomas Belluci and Juan Monaco having a large perctange of their wins on clay. Also, Sam Querrey, Jurgen Melzer, Alexander Dolgpolov had a large percentage of thier wins of Hard. This does show that some players tend to win more on particular surfaces. However, they could also have played more matches on these surfaces because they preferred it. I would like to see thier win_ratio per surface i.e For the number of matches they played on each surface, what was thier win_ratio?
OK…This looks more like what I wanted to see for the tournament winners. The y axis has high percentage range beacuse it’s showing the win ratio per 3 surfaces so that’s a 300% max value. Rafa Nadal with 4 Clay tournaments won 96% percent of all matches he played on Clay. Novak Djovokic won 91% of matches he played on Hard courts. Roger Federer had high percentages across all surfaces as did David Ferrer. Juan Monaco with 78% - not a lot players with higher win ratios on clay - won more tournaments on clay than Tomas Berdych, Roger Federer and Novak Djokovic who all had higher win_ratios on the surface. Novak and Roger’s high win_ratio on Hard, did translate to the highest tournament wins on the surface respectively.
There are some players like Tomas Berdych who had high win ratios on Clay or Hard for instance but didn’t quite translate to Tournament wins? Rafa Nadal also had a high win ratio on Hard Court, only behind Roger and Novak, yet he didn’t win any Hard court tournaments. These players might have been getting to a lot of finals or semi-finals but not actually winning the tournament.
I’ll add tournament_wins for each player to the merged_count_wins so I can build a new dataframe of variables to test correlations with.
str(atp_player_data)
## 'data.frame': 457 obs. of 6 variables:
## $ player_name : Factor w/ 457 levels "Adam Kellner",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ player_id : int 104775 105173 104494 103529 105874 103656 105077 104268 105561 104166 ...
## $ total_matches : int 2 8 18 2 4 23 56 39 5 46 ...
## $ wins : num 1 1 4 1 2 7 26 14 1 14 ...
## $ win_ratio : num 50 12.5 22.2 50 50 ...
## $ tournament_wins: num 0 0 0 0 0 0 0 0 0 0 ...
Now that that’s complete, I’ll like to take a quick look at wins across the different levels of tournament.
The ATP tour website describes these torunament level types:
So let’s look at the Tournement_level variable.
summary(atp_2012$tourney_level)
## A C D F G M
## 1616 15 312 15 508 559
Clearly, there were more matches at the ATP general level, with the next highest being the next highest being Grandslams and Masters.
## 'data.frame': 1830 obs. of 3 variables:
## $ player_name : Factor w/ 305 levels "Adam Kellner",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Tournament_Level: Factor w/ 6 levels "A","C","D","F",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Total_wins : int 0 1 0 0 2 6 20 7 1 12 ...
I subsetted the data to only look at players who had more than 5 wins on the Tour that year and I noticed a few things:
Most players on Tour have a lot of thier wins at the ATP Tournament Level with close to 70% of these players having wins only at that level. Although it’s not surprisng that most players have more wins at this level, I didn’t quite expect that quite so many would have wins only at this level.
Players with higher number of wins, have a good mix of Masters and Grandslam wins.
Next, I’d like to look at it winner_seed. This shows the seed (rank for tournament) of the winner who won the matches.
summary(atp_2012$winner_seed)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 5.000 7.017 9.000 32.000 1661
## int [1:3025] 1 NA NA 6 4 NA NA NA 8 NA ...
Looking at this summary, I was a bit thrown for a moment at the large number of NA’s. However, A seed is different from a ranking and there are about a 1/4 seeds per draw size. So drawsize 128 would have 32 seeds, drawsize 96 would have 16 seeds and so on.
This distribution is skewed to the right…There are far more winners who are seeded in from 1-8. With Most matches being won by the player seeded Number one for the tournament. In sum, higher seeded players win more matches than lower ranked seed players. Quel (not) Surprise! What about for tournaments? Are seeds more likely to win tournaments?
Wow!,In 2012 seeds were more likely to win especially if you were ranked 1 - 8!. Only 6 tournaments were won by Unseeded players and I bet they were all ATP - A level tournaments as well!
Yes, all the tournaments won by unseeded players were ATP level A general tournaments. For the Masters and Grandslams,the seeds ranked 1-4 won all the tournaments at that level. That’s some consistency!
On the ATP tour in 2012, Seeded players were most likely to win tournaments.
So far, I’ve looked at the number of wins and the win ratio for the players. I noticed that the win_ratio was likely a better determinant of the ability of a player to win matches as it showed the number of wins per matches played. Also, the players with the most wins and highest win ratio won the most number of tournaments.
Also, some players did very well on certain surfaces having a very high win ratio on those surfaces and in most cases that translated to toruanment wins; However, players did also have high win ratios on certain surfaces with no tournaments wins to show.
One of the most interesting observations was that most players with 5 wins and above only won matches at the ATP level. I expetced there to be a better distribution of wins across tournament types. It seemed that a few players were winning most matches at the higher tournament levels. This was more noticeable in the number of tournaments won by seeded players at the higher tournaments levels. Unseeded players won only ATP level tournaments.
It’s time to look at variables that involve both the winners and losers of matches. Are there are any variables that stand out or can show the differences between winners and losers?
Variables:
The values for these 4 variables are absolute values and could be skewed depending on how long the match was or how many sets the players had to play. It would be better to calculate the percentage of first serve points won based on the total amount of first serves put in play and the same with the second serve.
## player_type fs_percentage_won
## Length:6050 Min. : 18.18
## Class :character 1st Qu.: 64.29
## Mode :character Median : 71.21
## Mean : 70.91
## 3rd Qu.: 78.05
## Max. :100.00
## NA's :688
Looking at the histogram, The first serve percentage points won data appears to be normally distributed for both winners and loser; I plotted the winner and loser first serve percentage on the same graph so it could be easier to see the differences immediately.
Majority of the first serve percentage points won falls between 60% and 90%. However, we can see that the green colour shades more to the left showing that there are more values of green to the left which indicates a lower percentage of points won on the first serve by players that lose. Also, the pinkish colour shades more the right, with all winners of matches having 50% or higher first serve percentage.
The box plot shows that the median for the first serve perctentage points won for match winners is almost 10 points higer than those of match losers. Also, concentration of values for the top 25% shows a bigger number for the winners than losers. On average, winners have higher first percentage points won.
The high_fsperc_won is a new dataframe with 1st serve percentage points won for both winners and losers. Each observation is a match and I checked who had the higher 1st serve percentage points won in the match. I then recoreded whether it was the winner or loser of the match in the player type column.
## w_1stpwon l_1stpwon player_type
## Min. : 50.00 Min. : 18.18 Length:3025
## 1st Qu.: 70.91 1st Qu.: 59.52 Class :character
## Median : 76.47 Median : 65.71 Mode :character
## Mean : 76.62 Mean : 65.21
## 3rd Qu.: 82.14 3rd Qu.: 71.70
## Max. :100.00 Max. :100.00
## NA's :344 NA's :344
The bar graph shows that a player is 4 times as likely to win a match when they have a higher first serve percentage. Of the matches played, over 2500 were won by players with a higher first serve percentage points won.
## player_type ss_percentage_won
## Length:6050 Min. : 0.00
## Class :character 1st Qu.: 48.28
## Mode :character Median : 56.25
## Mean : 56.16
## 3rd Qu.: 64.00
## Max. :100.00
## NA's :688
The median for the second serve percentage won for winners is almost 10 points higher than that for the losers. Also, the histogram shows the winner - shaded pink - with more values to the right of histogram indicating that players who won matches had won more points on thier second serve. The average and median of all second serve perctange points won are 56.16 and 56.25 respectively, looking at the box plot, majority of the match winners had higher values higher than this.
In Tennis Matches, players get a two chances to serve for each point. You only get a second serve if you miss the first. If the serving player missed the second serve, the opposing player is automatically awarded a point. Players usually have very fast and powerful first serves making it harder for the opposing player to return it the serve.Also, players take more risks because they know they have a another chance. On the second serve however, if you lose it, you know that the oppsoing player will win the point, so player take less risk and the serve is therefor slower and less powerful. This gives the opposing player a better chance to attack the serve and win more points of it.
This is most likely the reason why we have this large difference in 2nd serve percentage points won between winners and losers. The players who are able to win more points on thier (weaker) second serves win more matches.
## w_2ndpwon l_2ndpwon player_type
## Min. : 18.18 Min. : 0.00 Length:3025
## 1st Qu.: 54.84 1st Qu.: 43.75 Class :character
## Median : 61.54 Median : 50.00 Mode :character
## Mean : 61.97 Mean : 50.35
## 3rd Qu.: 68.75 3rd Qu.: 57.69
## Max. :100.00 Max. :100.00
## NA's :344 NA's :344
Again, it can be seen here that when a player has a higher second server percentage points won, he/she was more likely to win the match - almost 4 times as likely!
Break point faced shows the amount of opportunities the opponent had to break serve. Each time a player wins a service game (on serve i.e the player is serving) they get awared a point,if the opposing player wins ie. breaks hte opposing player’s serve), the point is added to them isntead.
The variables are:
## player_type bp_faced
## Length:6050 Min. : 0.000
## Class :character 1st Qu.: 3.000
## Mode :character Median : 6.000
## Mean : 6.771
## 3rd Qu.: 9.000
## Max. :30.000
## NA's :688
The green shade for losers does edge to the right indicating a higher number of opportunities presented to the opponent to the break the serve. The histogram for winners edges closer to the left and we can see that players who won matches generally faced less break points.
The median for break points faced by winner is 4 while those for losers is about 8.75% or less of the winners face 7 breakpoints or less which is just slightly more than the median and mean of all the players.This definately makes sense because the less opportunitues your opponent has of breaking your serve, the less likely he is to break them. And if he/she can’t break your serve, it tougher for them to beat you.
It would be interesing to the relationship between winners and percentage of break points saved.
## player_type percentage_bpSaved
## Length:6050 Min. : 0.00
## Class :character 1st Qu.: 44.44
## Mode :character Median : 60.00
## Mean : 58.26
## 3rd Qu.: 75.00
## Max. :100.00
## NA's :973
I expected a clearer distinction between winners and losers in general. However, one thing that was clear though was that, far more winners saved 100% of break points that they faced; at least over 500 winners. Again, if a player saves all the break points that they face, then they are very unlikely to lose the matches. Those who saved 100% break points and lost thier matches would have done so in tie-breaks.
In the sections above, I tried to work through the different variables to find those which those which could show me interesting patterns about the winners(and losers) on Tour in 2012. I’ll choose the 3 which I think were most interesting (and surprising) to me.
I include this plot because it was plot that first showed the differences between match wins and win ratio. It also got me wondering how players did across different surfaces and how that might affect thier tournament wins.
This plot goes further to show the percentage of wins on each surface for players who won the most tournaments. It helped explain why some players like Djokovic with 91% win on Hard court, won more hard court tornaments than anyone else - the same with Rafa Nadal on Clay but it also raised a some questions - Some players with High win ratio on certain surfaces didn’t win any tournaments on those surfaces.
Although, I started out looking at winners of matches across the season, looking at vairables that show what causes players to win individual matches proved to be very interesting.I really wanted to put the first and second server perctentage plots togther here but I had to go with one and I went with the first serve percentage. There was a clear distinction showing that most winners of matches had 50% or higher first serve percentage.
Due to the high number of number of players and variables; it was very difficult to use ggpairs to look at correlations and I spent a lot time investigating and then discarding certian variables. I also had to do a lot of data manipulations because the data was not always in the format that I needed.
Additionally, when I started out, I thought I might be able to get a model to predict the winners of matches, but I didn’t think I had enough winner or loser data; I would have liked to have a breakdown how points where won : forehand winners, backhand winners , unforced errors etc for winners and losers. Furthermore, With the dataset I had, it was very different to the way the price prediction on the course worked; and I would need to predict who would win matches againts other players! Therefore each player would need some sort of score based on the model to predict thier ability to not only win matches but beat the other player.
I think I would need more data and more knowledge of Modelling with Categorical variables to create a model.
Starting out with this data from the ATP Tour in 2012, I was interested in discovering what really influenced players winning matches. A number of things stood out to me. Firstly, the amount of tournaments won my seeds especially at the Masters and Grandslam levels. These tournaments are very difficult to win and while it may make sense that the better players should naturally win these tournaments, the fact that it’s tough to win means the same players usually can’t win most of them. In 2012, however, The top 4 seeds won most of the Granslams and Masters and these were actually the same group of players. Being Seeded especially 1-4, meant a player was very likely to win matches and tournaments.
Secondly, In comparing winners and losers per match - it was clear that having a higher percentage points won on the first serve meant the player was more than 4 times likely to win the match. In this case, it’s not just having a high value - over 50% in general.
Lastly, there were some players with high win_ratios per surface who didn’t quite turn those into tournament wins. They appears to be some intagibles here that are not recorded that could be influencing the inability to cross the final Hurdle - Mental Toughness ? Experience?. I would be interested in expoloring this dataset further.